Answerability Classification with VizWiz Dataset

With the goal to assist blindness overcome their real daily visual challenges via Computer Vision and AI, VizWiz is a dataset with data submitted by users of a mobile phone application, who each took a picture and (optionally) recorded a spoken question about that picture. Website

Challenge

When blindness took pictures and asked a question, it will occur so many reasons to let the question can't be answered. To tackle this issue, we tried to extract the features from both images and questions and built a model to predict the set can be answered or not. VQA Challenge

My proposed prediction method

In the feature extracting part, I use four image-based features, including blur value, background color, foreground color, and tag of the image. On the other hand, I use three question-based features, including key phrases, sentiment value, and the first word of the question.

To make all the features can be the input of the model, I transfer it separately. I use one hot encoder to transfer foreground color, background color, and the first word of the question into 0 and 1 value. I also limit blur value and sentimental value to three decimal places only. Besides, I make the keyword comparison between the tag of the image and key phrases of the question. If the label of the image is in the question, the value will be 1. Otherwise, the value will be 0.

I choose blur feature is because sometimes the image is too unclear to be answered. Therefore, I think blur value can be a useful feature to help predict. Also, the reason I choose color is that some pictures will be too dark or too bright to be answered. In terms of question-based features, sentiment value may be a useful feature because the emotion of the word sometimes can decide whether you can get an answer or not. Besides, the first word of the question, like What, Where, and Why, can also determine whether you can get the answer or not, too. Therefore, I choose these features.

Last but not least, I compare the tag of the image with key phrases of the question is because that I think if the question asks something that includes in the image, the question will be answer easier.

The analysis I conducted with the training/validation datasets and my prediction system design.

To train the classification model, I prepare 2000 training data, 300 validation data, and 100 test data. In the process, I use a cross-validation method to get an overall accuracy of the model and have a test on the validation dataset to check the accuracy.

In terms of the choice of classification models, I tried some classification model that taught in the class and compared their accuracy. I use KNN, decision tree, SVM, Naive Bayes, Neural Network, these five models. However, I think the result is all similar to the accuracy only around 0.55.

After that, I also did some Emsemble methods like Voting and Adaboost. I think the result of the voting model is the same as all five classification model above. However, I found that Adaboost has an effect that is 0.58, which is a little higher than others. Therefore, I choose the Adaboost model to predict my test data in the end.

Coding Detail

(a)Use the VizWiz dataset with its pre-defined train/validation/test split.

# Framework for lab 3: predicting whether a question about an image can be answered
img_dir = "https://ivc.ischool.utexas.edu/VizWiz_visualization_img/"
split = 'train'
#split = 'val'
#split = 'test'
annotation_file = 'https://ivc.ischool.utexas.edu/VizWiz_final/vqa_data/Annotations/%s.json' %(split)
print(annotation_file)

(b) Define a feature representation for each visual question that uses

from skimage import io
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import cv2
from google.colab.patches import cv2_imshow
import skimage.feature as feature
from skimage import io, color
from skimage.transform import resize
from matplotlib import pyplot as plt
import skimage

subscription_key = '43e430a1bf9443e28c37ef13aad0baf2'
vision_base_url = 'https://southcentralus.api.cognitive.microsoft.com/vision/v1.0'

vision_analyze_url = vision_base_url + '/analyze?'

# evaluate an image using Microsoft Vision API
def analyze_image(image_url):
    # Visualize image
    image = io.imread(image_url)
    # plt.imshow(image)
    # plt.axis('off')
    # plt.show()
    
    # Microsoft API headers, params, etc
    headers = {'Ocp-Apim-Subscription-key': subscription_key}
    params = {'visualfeatures': 'Adult,Categories,Description,Color,Faces,ImageType,Tags'}
    data = {'url': image_url}
    
    # send request, get API response
    response = requests.post(vision_analyze_url, headers=headers, params=params, json=data)
    response.raise_for_status()
    analysis = response.json()
    return analysis

#calculate blur value
def variance_of_laplacian(img_url):
  image = io.imread(img_url)
  width = 255
  height = 255
  image = resize(image, (width, height))
  greyscale_image = skimage.color.rgb2gray(image) 
  fm = round(cv2.Laplacian(greyscale_image, cv2.CV_64F).var() * 50, 3)
  return fm
    
def extract_image_features(image_url):
  #get the Azure computer vision analysis result from picture
  data = analyze_image(image_url)

  #dominant Foreground Color in picture
  foreColor = ''
  if len(data['color']['dominantColorForeground']) == 0:
    foreColor = 'no'
  else:
    foreColor = str(data['color']['dominantColorForeground'])

  #dominant Background Color in picture
  backColor = ''
  if len(data['color']['dominantColorBackground']) == 0:
    backColor = 'no'
  else:
    backColor = str(data['color']['dominantColorBackground'])

  #key tags in picture
  keyword = []
  for i in range(len(data['tags'])):
    x = data['tags'][i]['name'].split(' ')
    for j in x:
      keyword.append(str(j))
  keyword = np.array(keyword)

  #blur value in picture
  blur = variance_of_laplacian(image_url)
  
  return foreColor, backColor, keyword, blur

def analyze_question(question):
  dic = {"documents": [{"id":1, "text": question}]}
  #print(question)
  json.dumps(question)

  subscription_key = '43e430a1bf9443e28c37ef13aad0baf2'
  endpoint = 'https://southcentralus.api.cognitive.microsoft.com'
  
  sentiment_url = endpoint + "/text/analytics/v2.1/sentiment"
  headers = {"Ocp-Apim-Subscription-Key": subscription_key}
  response = requests.post(sentiment_url, headers=headers, json=dic)
  sentiments = response.json()
  sentimentsValue = round(sentiments['documents'][0]['score'], 3)
  #pprint(sentiments)

  key = []
  keyphrase_url = endpoint + "/text/analytics/v2.1/keyphrases"  
  response = requests.post(keyphrase_url, headers=headers, json=dic)
  key_phrases = response.json()

  #key = key_phrases['documents'][0]['keyPhrases']
  if(len(key_phrases['documents'][0]['keyPhrases'])!= 0):
      for i in key_phrases['documents'][0]['keyPhrases']:
        x = i.split(' ')
        for j in x:
          key.append(j);

  return sentimentsValue, key


def extract_question_features(question):

  #get the Azure text analusis result from question
  sentiments, keyword = analyze_question(question)

  #extract first word of question
  Qword = str(np.array(question.split(' ')[0]))
  
  return sentiments, keyword, Qword

# Read the file to extract each dataset example with label
import requests
import numpy as np

split_data = requests.get(annotation_file, allow_redirects=True)
num_VQs = 2000
k = 0
data = split_data.json()
X = []
y = []

foregroundColor = []
backgroundColor = []
questionWord = []

for vq in data[0:num_VQs]:
  
  # Extracts features decribing the image
  image_name = vq['image']
  image_url = img_dir + image_name 
  forecolor, backcolor, key_vision, blur = extract_image_features(image_url)
  print(forecolor, backcolor, key_vision, blur)

  # Extracts features decribing the question
  question = vq['question']
  sentiments, key_text, Qword = extract_question_features(question)
  print(sentiments, key_text, Qword)

  #check the question keyword is in picture or not
  keyFeature = 0
  matchKey = [i for i in key_vision if i in key_text]
  if len(matchKey) > 0:
    keyFeature = 1

  # Create a multimodal feature to represent both the image and question
  multimodal_features = np.array([blur, sentiments, keyFeature])
  
  # Prepare features and labels
  X.append(multimodal_features)
  label = vq['answerable']
  y.append(label)

  print(k, multimodal_features, label)
  k += 1

  foregroundColor.append(forecolor)
  backgroundColor.append(backcolor)
  questionWord.append(Qword)
  
  # print(image_name)
  # print(question)
  # print(label)
  # print(multimodal_features)
  # visualize_image(image_url)

from sklearn.preprocessing import OneHotEncoder

def oneHotTransform(Tarray):
  enc = OneHotEncoder()
  a = np.reshape(Tarray, (-1, 1))
  enc.fit(a)
  ans = enc.transform(a).toarray()
  return ans
  
forecolorFeature = oneHotTransform(foregroundColor)
backcolorFeature = oneHotTransform(backgroundColor)
QwordFeature = oneHotTransform(questionWord)

X = np.concatenate((X, forecolorFeature, backcolorFeature, QwordFeature), axis = 1)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold

training_precision_manhattan = []
training_precision_euclidean = []
best_precision = 0
for i in range(1, 3):
  
  neighbor_setting = range(3, 20)
  for curKvalue in neighbor_setting:
    
    knn_clf = KNeighborsClassifier(n_neighbors = curKvalue, p= i)

    kfold_shuffled = StratifiedKFold(n_splits=5, shuffle=True, random_state=20)
    fold_train_precision = cross_val_score(knn_clf, X_train_reduced, Y_train, cv=kfold_shuffled, scoring = 'precision')

    cur_train_precision = fold_train_precision.mean()

    if(cur_train_precision > best_precision):
      best_param = {'p': i, 'n_neighbors': curKvalue}
      best_precision = cur_train_precision
    
    if(i == 1):
      training_precision_manhattan.append(cur_train_precision)
    else:
      training_precision_euclidean.append(cur_train_precision)

knn_clf = KNeighborsClassifier(**best_param)
knn_clf.fit(X_train_reduced, Y_train)

knn_pred = knn_clf.predict(X_val_reduced)
print(classification_report(knn_pred, Y_val))

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

training_precision_gini = []
training_precision_entropy = []

best_precision = 0
for i in ['gini', 'entropy']:
  
  tree_setting = range(3, 20)
  for value in tree_setting:
    
    tree_clf = DecisionTreeClassifier(criterion = i, max_depth = value)

    kfold_shuffled = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    fold_train_precision = cross_val_score(tree_clf, X_train_reduced, Y_train, cv=kfold_shuffled, scoring = 'precision' )

    cur_train_precision = fold_train_precision.mean()

    if(cur_train_precision > best_precision):
      best_param = {'criterion': i, 'max_depth': value}
      best_precision = cur_train_precision
    
    if(i == 'gini'):
      training_precision_gini.append(cur_train_precision)
    else:
      training_precision_entropy.append(cur_train_precision)
      
tree_clf = DecisionTreeClassifier(**best_param)
tree_clf.fit(X_train_reduced, Y_train)

tree_pred = tree_clf.predict(X_val_reduced)
print(classification_report(tree_pred, Y_val))

from sklearn.svm import SVC

G1 = 1 / (X_train_reduced.var() * X_train_reduced[1].size) + 0.00001
G2 = 1 / (X_train_reduced.var() * X_train_reduced[1].size) + 0.00002

best_precision = 0
for curD in range(2,6):
  for curC in [0.1, 1, 10]:
    for curG in ['scale', G1, G2]:
      param = {'C': curC, 'degree': curD, 'gamma': curG}
      print(param)
      svm_clf = SVC(kernel='poly', degree=curD, C=curC, gamma=curG)

      kfold_shuffled = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)
      fold_train_precision = cross_val_score(svm_clf, X_train_reduced, Y_train, cv=kfold_shuffled, scoring = 'precision')
      precision = fold_train_precision.mean()

      if precision > best_precision:
        best_param = {'C': curC, 'degree': curD, 'gamma': curG}
        best_precision = precision

svm_clf = SVC(**best_param)
svm_clf.fit(X_train_reduced, Y_train)
svm_pred = svm_clf.predict(X_val_reduced)
print(classification_report(svm_pred, Y_val))

from sklearn.neural_network import MLPClassifier

d_hidden = []
d_nn_acc = []


for i in range(1,6):
  n_hidden_nodes = 64 * i
  acc = []
  for j in range(1,6):
    layers = [n_hidden_nodes] * j
    mlp = MLPClassifier(activation='tanh', hidden_layer_sizes = layers, max_iter=20, verbose=False)
    mlp.fit(X_train_reduced, Y_train)    
    acc.append(mlp.score(X_val_reduced,Y_val))
    print(n_hidden_nodes, j, mlp.loss_, mlp.score(X_val_reduced,Y_val))
  d_hidden.append(i)
  d_nn_acc.append(acc)
  
n_hidden_nodes = 320
layers = [n_hidden_nodes] * 5
mlp = MLPClassifier(activation='tanh', hidden_layer_sizes = layers, max_iter=20, verbose=False)
mlp.fit(X_train_reduced, Y_train)
acc =   mlp.score(X_val_reduced,Y_val)
mlp_pred = mlp.predict(X_val_reduced)

from sklearn.ensemble import VotingClassifier

eclf = VotingClassifier(estimators=[('knn', knn_clf), ('dt', tree_clf), ('svm', svm_clf), ('nb', gaussian_model), ('nn', mlp)], voting='hard')
eclf.fit(X_train_reduced, Y_train)
vote_pred = eclf.predict(X_val_reduced)
print(classification_report(vote_pred, Y_val))

(d)The test prediction results

import csv
predictions = adabooster.predict(X_test_reduced)
# f = open("results.csv", mode="w")
with open("/content/drive/My Drive/Colab Notebooks/results.csv", mode="w") as f:
  results = csv.writer(f)
  for prediction in predictions:
    results.writerow([prediction])